Methodology

  1. Import data
  2. EDA
  3. Check Null Values & use appropriate fillers
  4. Check the distribution of Class variable
  5. Implement a Label Encoder
  6. Check the variables distribution- Bivariate Analysis
  7. Check the Correlation between the variables
  8. Prepare X and Y
  9. Split the data into Train and Test
  10. Train an SVM and get accuracy
  11. Use PCA and capture 95% of the variance in the data
  12. Find the accuracy of the model using PCA
  13. Compare the accuracy of SVM using raw data and SVM using PCA
  14. Scale X
  15. Repeat steps 9-13 above using the scaled X
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
import warnings 
warnings.filterwarnings('ignore')
In [2]:
#reading the data
data = pd.read_csv("vehicle.csv")
In [3]:
data.head() #checking the head
Out[3]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [4]:
data.shape #checking the shape of the data
Out[4]:
(846, 19)
In [5]:
data.describe().T 
Out[5]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.0 119.0
circularity 841.0 44.828775 6.152172 33.0 40.00 44.0 49.0 59.0
distance_circularity 842.0 82.110451 15.778292 40.0 70.00 80.0 98.0 112.0
radius_ratio 840.0 168.888095 33.520198 104.0 141.00 167.0 195.0 333.0
pr.axis_aspect_ratio 844.0 61.678910 7.891463 47.0 57.00 61.0 65.0 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.0 55.0
scatter_ratio 845.0 168.901775 33.214848 112.0 147.00 157.0 198.0 265.0
elongatedness 845.0 40.933728 7.816186 26.0 33.00 43.0 46.0 61.0
pr.axis_rectangularity 843.0 20.582444 2.592933 17.0 19.00 20.0 23.0 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.0 188.0
scaled_variance 843.0 188.631079 31.411004 130.0 167.00 179.0 217.0 320.0
scaled_variance.1 844.0 439.494076 176.666903 184.0 318.00 363.5 587.0 1018.0
scaled_radius_of_gyration 844.0 174.709716 32.584808 109.0 149.00 173.5 198.0 268.0
scaled_radius_of_gyration.1 842.0 72.447743 7.486190 59.0 67.00 71.5 75.0 135.0
skewness_about 840.0 6.364286 4.920649 0.0 2.00 6.0 9.0 22.0
skewness_about.1 845.0 12.602367 8.936081 0.0 5.00 11.0 19.0 41.0
skewness_about.2 845.0 188.919527 6.155809 176.0 184.00 188.0 193.0 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0
In [6]:
data.dtypes
Out[6]:
compactness                      int64
circularity                    float64
distance_circularity           float64
radius_ratio                   float64
pr.axis_aspect_ratio           float64
max.length_aspect_ratio          int64
scatter_ratio                  float64
elongatedness                  float64
pr.axis_rectangularity         float64
max.length_rectangularity        int64
scaled_variance                float64
scaled_variance.1              float64
scaled_radius_of_gyration      float64
scaled_radius_of_gyration.1    float64
skewness_about                 float64
skewness_about.1               float64
skewness_about.2               float64
hollows_ratio                    int64
class                           object
dtype: object
In [7]:
data.skew()
Out[7]:
compactness                    0.381271
circularity                    0.261809
distance_circularity           0.106585
radius_ratio                   0.394978
pr.axis_aspect_ratio           3.830362
max.length_aspect_ratio        6.778394
scatter_ratio                  0.607271
elongatedness                  0.047847
pr.axis_rectangularity         0.770889
max.length_rectangularity      0.256359
scaled_variance                0.651598
scaled_variance.1              0.842034
scaled_radius_of_gyration      0.279317
scaled_radius_of_gyration.1    2.083496
skewness_about                 0.776519
skewness_about.1               0.688017
skewness_about.2               0.249321
hollows_ratio                 -0.226341
dtype: float64

Observations

  1. There are a total of 846 records with 19 columns
  2. From describe the count is not 846 for all the columns and this indicates missing values
  3. 18 columns are of numeric type while the class column is of type object
  4. max.length_aspect_ratio , pr.axis_aspect_ratio, scaled_radius_of_gyration.1 are highly skewed while the other columns are marginally skewed. hollows_ratio is negatively skewed
In [8]:
# Checking for missing data
data.isnull().sum()
Out[8]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64

Observations

Many of the variables have missing data except compactness, max.length_aspect_ratio,max.length_rectangularity,hollows_ratio and class

In [9]:
# checking the median of the data

data.median()
Out[9]:
compactness                     93.0
circularity                     44.0
distance_circularity            80.0
radius_ratio                   167.0
pr.axis_aspect_ratio            61.0
max.length_aspect_ratio          8.0
scatter_ratio                  157.0
elongatedness                   43.0
pr.axis_rectangularity          20.0
max.length_rectangularity      146.0
scaled_variance                179.0
scaled_variance.1              363.5
scaled_radius_of_gyration      173.5
scaled_radius_of_gyration.1     71.5
skewness_about                   6.0
skewness_about.1                11.0
skewness_about.2               188.0
hollows_ratio                  197.0
dtype: float64
In [10]:
#dealing with missing data

newData = data # to keep the original df constant, introducing a new df

#define a median filler
medianFiller = lambda x: x.fillna(x.median())

#apply the filler
newData.iloc[:,0:18] = data.iloc[:,0:18].apply(medianFiller)
In [11]:
#checking for missing data in the new dataframe

newData.isnull().sum()
Out[11]:
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64

Observations

All the missing values have been replaced and it can be seen the data frame has no missing values.

Understanding the class variable

In [12]:
newData['class'].value_counts()
Out[12]:
car    429
bus    218
van    199
Name: class, dtype: int64
In [13]:
newData['class'].value_counts(1)*100
Out[13]:
car    50.709220
bus    25.768322
van    23.522459
Name: class, dtype: float64

Observations

  1. There are three classes or types of vehicles Car, Bus and Van
  2. The distribution is Car approx 50%, Bus approx 25%, Van approx 23%

Next Steps

Since the class is an object variable now, it will be difficult to see the distribution of other columns with respect to the class and hence it is required to be converted into a numeric type using an Encoder.

In [14]:
newData.dtypes # checking the datatype before encoding
Out[14]:
compactness                      int64
circularity                    float64
distance_circularity           float64
radius_ratio                   float64
pr.axis_aspect_ratio           float64
max.length_aspect_ratio          int64
scatter_ratio                  float64
elongatedness                  float64
pr.axis_rectangularity         float64
max.length_rectangularity        int64
scaled_variance                float64
scaled_variance.1              float64
scaled_radius_of_gyration      float64
scaled_radius_of_gyration.1    float64
skewness_about                 float64
skewness_about.1               float64
skewness_about.2               float64
hollows_ratio                    int64
class                           object
dtype: object
In [15]:
# using a Label Encoder

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

data_le = newData # encoded df
data_le['class'] = le.fit_transform(data_le['class'])
In [16]:
data_le.dtypes # checking the df after encoding
Out[16]:
compactness                      int64
circularity                    float64
distance_circularity           float64
radius_ratio                   float64
pr.axis_aspect_ratio           float64
max.length_aspect_ratio          int64
scatter_ratio                  float64
elongatedness                  float64
pr.axis_rectangularity         float64
max.length_rectangularity        int64
scaled_variance                float64
scaled_variance.1              float64
scaled_radius_of_gyration      float64
scaled_radius_of_gyration.1    float64
skewness_about                 float64
skewness_about.1               float64
skewness_about.2               float64
hollows_ratio                    int64
class                            int64
dtype: object
In [17]:
data_le['class'].value_counts()
Out[17]:
1    429
0    218
2    199
Name: class, dtype: int64
In [18]:
sns.pairplot(data_le, hue = 'class')
Out[18]:
<seaborn.axisgrid.PairGrid at 0x7ff045d2b0d0>

Observations

Listing out only a few observations

  1. Bus, Van, and Car are represented by blue, green, orange colors on the graph
  2. scaled_radius_of_gyration.1, pr.axis_aspect_ratio, max.length_aspect_ratio, radius_ratio have significant outliers as seen from the distribution
  3. max.length_aspect_ratio has multiple peaks in its distribution similar to scaled_variance.1
  4. a few of the features have a good correlation as observed from the plots, skewness_about.2 vs hollows_ratio,skewness_about.1 vs hollows_ratio,
  5. Elongatedness and scaled_variance.1 has a very strong negative correlation
  6. Most of the other distributions have outliers
In [19]:
# Checking for Correlation

corr = data_le.corr()

mask = np.zeros_like(corr, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize = (15, 15))

cmap = sns.diverging_palette(220, 10, as_cmap = True)

sns.heatmap(corr, mask = mask, cmap = cmap, vmax = 1, center = 0, square = True, 
            linewidths = .5, cbar_kws = {"shrink": .5}, annot = True)
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff036f08410>

Observations

  1. Only 5 features have a positive correlation with class and none of them show a real strong correlation
  2. 13 features are negatively correlated in various degrees of correlation
  3. Elongatedness, hollows_ratio, max.length_aspect_ratio are fairly correlated in order of correlation and can be considered alone for buliding base model.
  4. At the same time, hollows_ratio has a strong correlation with skewness_about.2, fair correlation with radius_ratio and compactness and hence even these can be considered in building the model.
  5. Also, max.length_aspect_ratio has a fairly strong correlation with pr.axis_aspect_ratio, followed by low correlation with distance_circularity, circularity and compactness.

So it is observed there is a fair bit of correlation among the features and this will drive the inclusion of most of them to build a base model.

Version 1 - Dropping only the class column

In [20]:
# preparing data - dropping only the class column

from sklearn.model_selection import train_test_split

X = data_le.drop(columns = ['class'])     # Predictor feature columns
Y = data_le['class']   # Predicted class
In [21]:
#preparing the training and testing data set

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=25)

#check the data split

print("{0:0.2f}% data is in training set".format((len(x_train)/len(data_le.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_test)/len(data_le.index)) * 100))
69.98% data is in training set
30.02% data is in test set
In [22]:
#SVM - non scaled version

from sklearn import svm
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score


SVM = svm.SVC(random_state=25)    
SVM.fit(x_train, y_train)

SVM_train_predict = SVM.predict(x_train)
SVM_test_predict = SVM.predict(x_test)

print('SVM accuracy for train set: {0:.3f}'.format(SVM.score(x_train, y_train)))
print('SVM accuracy for test set: {0:.3f}'.format(SVM.score(x_test, y_test)))

# Classification Report
print('\n{}'.format(classification_report(y_test, SVM_test_predict)))

# Accuracy Score
auc = accuracy_score(y_test, SVM_test_predict)
print('\nAccuracy Score:', auc.round(3))
SVM accuracy for train set: 0.681
SVM accuracy for test set: 0.697

              precision    recall  f1-score   support

           0       0.58      0.64      0.61        59
           1       0.83      0.75      0.79       130
           2       0.58      0.63      0.60        65

    accuracy                           0.70       254
   macro avg       0.66      0.68      0.67       254
weighted avg       0.71      0.70      0.70       254


Accuracy Score: 0.697
In [23]:
# Performing Cross Validation

from sklearn.model_selection import cross_val_score

svm_cvs = cross_val_score(svm.SVC(random_state=25), X, Y)
svm_cvs_mean = svm_cvs.mean()
svm_cvs_mean
Out[23]:
0.6808980160111382

Version 2 - Dropping class and other non-influencing columns

In [24]:
# preparing data - dropping columns which doesn't seem to influence the predictor variable 

Xnew = data_le.drop(columns = ['class', 'skewness_about', 'skewness_about.1', 'skewness_about.2', 'scaled_radius_of_gyration.1'])     # Predictor feature columns
Ynew = data_le['class']   # Predicted class
In [25]:
#preparing the training and testing data set

x_trainN, x_testN, y_trainN, y_testN = train_test_split(Xnew, Ynew, test_size=0.3, random_state=25)

#check the data split

print("{0:0.2f}% data is in training set".format((len(x_trainN)/len(data_le.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_testN)/len(data_le.index)) * 100))
69.98% data is in training set
30.02% data is in test set
In [26]:
#SVM - non scaled version and columns dropped 

from sklearn import svm
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score


SVMN = svm.SVC(random_state=25)    
SVMN.fit(x_trainN, y_trainN)

SVMN_train_predict = SVMN.predict(x_trainN)
SVMN_test_predict = SVMN.predict(x_testN)

print('SVM accuracy for train set: {0:.3f}'.format(SVMN.score(x_trainN, y_trainN)))
print('SVM accuracy for test set: {0:.3f}'.format(SVMN.score(x_testN, y_testN)))

# Classification Report
print('\n{}'.format(classification_report(y_testN, SVMN_test_predict)))

# Accuracy Score
aucNew = accuracy_score(y_testN, SVMN_test_predict)
print('\nAccuracy Score:', aucNew.round(3))
SVM accuracy for train set: 0.684
SVM accuracy for test set: 0.689

              precision    recall  f1-score   support

           0       0.57      0.64      0.60        59
           1       0.83      0.75      0.79       130
           2       0.57      0.62      0.59        65

    accuracy                           0.69       254
   macro avg       0.66      0.67      0.66       254
weighted avg       0.70      0.69      0.69       254


Accuracy Score: 0.689
In [27]:
# Performing Cross Validation

svmN_cvs = cross_val_score(svm.SVC(random_state=25), Xnew, Ynew)
svmN_cvs_mean = svmN_cvs.mean()
svmN_cvs_mean
Out[27]:
0.6797076226940482

Observations

There is no significant improvement in the accuracy of the model atter dropping more features.

Performing PCA - non scaled features

In [28]:
from sklearn.decomposition import PCA

pca = PCA(n_components=18) # no. of components = no. of features
pca.fit(X)

# print(pca.explained_variance_)
# print(pca.components_)
Out[28]:
PCA(n_components=18)
In [29]:
print(pca.explained_variance_ratio_)
[9.58285488e-01 1.82449163e-02 1.22165579e-02 3.97175689e-03
 2.05200685e-03 1.34631794e-03 1.21078265e-03 7.84099280e-04
 6.41282084e-04 3.86729714e-04 3.20443581e-04 2.02861904e-04
 1.37698997e-04 8.04741042e-05 4.92405216e-05 3.44218306e-05
 3.20409732e-05 2.88028725e-06]
In [30]:
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('Eigen Value')
plt.show()
In [31]:
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()

Observations

It can be observed the first component accounts for more than 95% of the variance and hence only one component will be considered further for buliding the model.

In [32]:
pca = PCA(n_components=1)
pca.fit(X)
# print(pca.components_)
# print(pca.explained_variance_ratio_)
Xpca = pca.transform(X)

Fitting an SVM using the Principal Components

In [34]:
# using principal components to fit an SVM model

model_pca = svm.SVC(random_state=25)
model_pca.fit(Xpca, Y)
model_pca.score(Xpca, Y)
Out[34]:
0.6335697399527187

Version 3 - Scaling the features and fitting an SVM

In [35]:
# scaling the features

from scipy.stats import zscore
XScaled=X.apply(zscore)
XScaled.head()
Out[35]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
0 0.160580 0.518073 0.057177 0.273363 1.310398 0.311542 -0.207598 0.136262 -0.224342 0.758332 -0.401920 -0.341934 0.285705 -0.327326 -0.073812 0.380870 -0.312012 0.183957
1 -0.325470 -0.623732 0.120741 -0.835032 -0.593753 0.094079 -0.599423 0.520519 -0.610886 -0.344578 -0.593357 -0.619724 -0.513630 -0.059384 0.538390 0.156798 0.013265 0.452977
2 1.254193 0.844303 1.519141 1.202018 0.548738 0.311542 1.148719 -1.144597 0.935290 0.689401 1.097671 1.109379 1.392477 0.074587 1.558727 -0.403383 -0.149374 0.049447
3 -0.082445 -0.623732 -0.006386 -0.295813 0.167907 0.094079 -0.750125 0.648605 -0.610886 -0.344578 -0.912419 -0.738777 -1.466683 -1.265121 -0.073812 -0.291347 1.639649 1.529056
4 -1.054545 -0.134387 -0.769150 1.082192 5.245643 9.444962 -0.599423 0.520519 -0.610886 -0.275646 1.671982 -0.648070 0.408680 7.309005 0.538390 -0.179311 -1.450481 -1.699181
In [36]:
#preparing the training and testing data set for scaled data

x_trainS, x_testS, y_trainS, y_testS = train_test_split(XScaled, Y, test_size=0.3, random_state=25)

#check the data split

print("{0:0.2f}% data is in training set".format((len(x_trainS)/len(data_le.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(x_testS)/len(data_le.index)) * 100))
69.98% data is in training set
30.02% data is in test set
In [37]:
#SVM - scaled version

SVMScaled = svm.SVC(random_state=25)    
SVMScaled.fit(x_trainS, y_trainS)

SVM_train_predictS = SVMScaled.predict(x_trainS)
SVM_test_predictS = SVMScaled.predict(x_testS)

print('SVM accuracy for train set: {0:.3f}'.format(SVMScaled.score(x_trainS, y_trainS)))
print('SVM accuracy for test set: {0:.3f}'.format(SVMScaled.score(x_testS, y_testS)))

# Classification Report
print('\n{}'.format(classification_report(y_testS, SVM_test_predictS)))

# Accuracy Score
aucS = accuracy_score(y_testS, SVM_test_predictS)
print('\nAccuracy Score:', aucS.round(3))
SVM accuracy for train set: 0.980
SVM accuracy for test set: 0.972

              precision    recall  f1-score   support

           0       1.00      0.97      0.98        59
           1       1.00      0.96      0.98       130
           2       0.90      1.00      0.95        65

    accuracy                           0.97       254
   macro avg       0.97      0.98      0.97       254
weighted avg       0.98      0.97      0.97       254


Accuracy Score: 0.972
In [38]:
# Performing Cross Validation

svmS_cvs = cross_val_score(svm.SVC(random_state=25), XScaled, Y)
svmS_cvs_mean = svmS_cvs.mean()
svmS_cvs_mean
Out[38]:
0.9657361642882005

Performing PCA - Scaled features

In [39]:
pcaS = PCA(n_components=18)
pcaS.fit(XScaled)
Out[39]:
PCA(n_components=18)
In [40]:
plt.bar(list(range(1,19)),pcaS.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
In [41]:
plt.step(list(range(1,19)),np.cumsum(pcaS.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()
In [42]:
# Considering 6 components which represent close to 95% of the variance 

pcaS = PCA(n_components=6)
pcaS.fit(XScaled)
#print(pcaS.components_)
print(pcaS.explained_variance_ratio_)
[0.52186034 0.16729768 0.10562639 0.0654746  0.05089869 0.02996413]
In [43]:
XpcaS = pcaS.transform(XScaled)
In [44]:
model_pcaS = svm.SVC(random_state=25)
model_pcaS.fit(XpcaS, Y)
model_pcaS.score(XpcaS, Y)
Out[44]:
0.9066193853427896

Final Observations and Conclusions

Checking the accuracies on the Test data set

  1. Accuracy of Base SVM model: $0.69$
  2. Accuracy of Base SVM Model + Cross Validation: $0.68$
  3. Accuracy of Base SVM Model + PCA: $0.63$
  4. Accuracy of Base SVM Model + Scaled Features: $0.97$
  5. Accuracy of Base SVM Model + Scaled Features + Cross Validation: $0.96$
  6. Accuracy of Base SVM Model + Scaled Features + PCA: $0.906$

By dropping the dimensions to 1 for the non-scaled model the accuracy has reduced by 5% while in the case of scaled features by dropping the dimensions to 6 there is a drop of around 6% in the accuracy. The scaled model perfomed well and has a good test accuracy of around 90% with PCA using the SVM model.

In [ ]: